Video-Action Model
Vision Language Action Model
映像基盤モデル
Vision Language Model
Learning from Video
mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs
https://arxiv.org/abs/2512.15692
egoverse
https://egoverse.ai/
VTAM: Video-Tactile-Action Models for Complex Physical Interaction Beyond VLAs
https://huggingface.co/papers/2603.23481
Video-Action Models は長時間タスクで視覚的推論に優れるが、接触が重要な操作では視覚のみでは不十分。